Credit Card Fraud Detection Exploration and Analysis by Pengchong Tang

Introduction: This report explores a credit card fraud detection dataset from Kaggle.com. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Univariate Plots Section

Summary the dataset:

##       Time              V1                  V2           
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855  
##  Median : 84692   Median :  0.01811   Median :  0.06549  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773  
##        V3                 V4                 V5            
##  Min.   :-48.3256   Min.   :-5.68317   Min.   :-113.74331  
##  1st Qu.: -0.8904   1st Qu.:-0.84864   1st Qu.:  -0.69160  
##  Median :  0.1799   Median :-0.01985   Median :  -0.05434  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :   0.00000  
##  3rd Qu.:  1.0272   3rd Qu.: 0.74334   3rd Qu.:   0.61193  
##  Max.   :  9.3826   Max.   :16.87534   Max.   :  34.80167  
##        V6                 V7                 V8           
##  Min.   :-26.1605   Min.   :-43.5572   Min.   :-73.21672  
##  1st Qu.: -0.7683   1st Qu.: -0.5541   1st Qu.: -0.20863  
##  Median : -0.2742   Median :  0.0401   Median :  0.02236  
##  Mean   :  0.0000   Mean   :  0.0000   Mean   :  0.00000  
##  3rd Qu.:  0.3986   3rd Qu.:  0.5704   3rd Qu.:  0.32735  
##  Max.   : 73.3016   Max.   :120.5895   Max.   : 20.00721  
##        V9                 V10                 V11          
##  Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268  
##       V15                V16                 V17           
##  Min.   :-4.49894   Min.   :-14.12985   Min.   :-25.16280  
##  1st Qu.:-0.58288   1st Qu.: -0.46804   1st Qu.: -0.48375  
##  Median : 0.04807   Median :  0.06641   Median : -0.06568  
##  Mean   : 0.00000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.64882   3rd Qu.:  0.52330   3rd Qu.:  0.39968  
##  Max.   : 8.87774   Max.   : 17.31511   Max.   :  9.25353  
##       V18                 V19                 V20           
##  Min.   :-9.498746   Min.   :-7.213527   Min.   :-54.49772  
##  1st Qu.:-0.498850   1st Qu.:-0.456299   1st Qu.: -0.21172  
##  Median :-0.003636   Median : 0.003735   Median : -0.06248  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   :  0.00000  
##  3rd Qu.: 0.500807   3rd Qu.: 0.458949   3rd Qu.:  0.13304  
##  Max.   : 5.041069   Max.   : 5.591971   Max.   : 39.42090  
##       V21                 V22                  V23           
##  Min.   :-34.83038   Min.   :-10.933144   Min.   :-44.80774  
##  1st Qu.: -0.22839   1st Qu.: -0.542350   1st Qu.: -0.16185  
##  Median : -0.02945   Median :  0.006782   Median : -0.01119  
##  Mean   :  0.00000   Mean   :  0.000000   Mean   :  0.00000  
##  3rd Qu.:  0.18638   3rd Qu.:  0.528554   3rd Qu.:  0.14764  
##  Max.   : 27.20284   Max.   : 10.503090   Max.   : 22.52841  
##       V24                V25                 V26          
##  Min.   :-2.83663   Min.   :-10.29540   Min.   :-2.60455  
##  1st Qu.:-0.35459   1st Qu.: -0.31715   1st Qu.:-0.32698  
##  Median : 0.04098   Median :  0.01659   Median :-0.05214  
##  Mean   : 0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.43953   3rd Qu.:  0.35072   3rd Qu.: 0.24095  
##  Max.   : 4.58455   Max.   :  7.51959   Max.   : 3.51735  
##       V27                  V28                Amount         Class     
##  Min.   :-22.565679   Min.   :-15.43008   Min.   :    0.00   0:284315  
##  1st Qu.: -0.070840   1st Qu.: -0.05296   1st Qu.:    5.60   1:   492  
##  Median :  0.001342   Median :  0.01124   Median :   22.00             
##  Mean   :  0.000000   Mean   :  0.00000   Mean   :   88.35             
##  3rd Qu.:  0.091045   3rd Qu.:  0.07828   3rd Qu.:   77.17             
##  Max.   : 31.612198   Max.   : 33.84781   Max.   :25691.16

The raw dataset consists of 284807 transcation records of which 492 records are fraudulent. There is no missing value in the dataset. I also found 1081 duplicate records in the dataset. These duplicates will be removed when I create a t-SNE plot so as to prevent erroneous messages.

Univariate Analysis

Explore the Class:

The dataset is highly imbalanced, the fraudulent records account for only 0.172% of all transactions.

Explore the Time:

Histogram of Time per minute

Histogram of Time per hour

The density of Time

The largest number of Time is 172792 second which roughly equals to 48 hours. It looks like there are two peaks as well as two saddles during these two days. I assume the peak time occurs at daytime and the saddle period occurs at night. I wonder if I can transform the Time into hour, a categoical variable to represent the hours in one day. Assuming the time starts from 12:00am.

The time 9:00-22:00 is a rush hour when most of the transaction committed.

Explore the Amout:

Histogram of Amount

The distribution of Amount is highly skewed. After plotting on a log scale, it appears a normal-like bimodal distribution.

Explore V1-V28

Let’s plot the histograms of V1-V28.

Boxplot of V1-V28

Can’t see the box? Let’s make another boxplot of V1-V28 with most outliers removed.

The plots show most distributions are low skewness with zero mean, some of them are high kurtosis e.g. V28, some distributions are close to normal e.g. V13. V1 is high left-skewed, I make a transformation function log10(-x+3) that converts the long tail into a better shape.

This plot shows the V1 after transformation. It appears three peaks where the data cluster.

What is the structure of your dataset?

The dataset contains 284807 transaction records in two days. The transactions are ordered by Time. The fraudulent transactions account for only 0.172% of all transactions. The median and mean of the transaction amount are both less than 100, the maximum amount is 25691.16. V1-V28 are zero mean distributions with either high skewed or high kurtosis.

What is/are the main feature(s) of interest in your dataset?

The main features are Class and all independent variables which might be useful to predict the frauds.

What other features in the dataset do you think will help support your  investigation into your feature(s) of interest?

The Time may help support to detect the frauds. I wonder if the fraud has a different distribution compared to normal transaction, for example, more frauds occur at night.

Did you create any new variables from existing variables in the dataset?

The Time counts the second elapsed between the current transaction and the first transaction. I need to transform the Time to a meaningful variable other than just counting number. The peak time of transactions seems periodic with a 24-hour cycle. So I create a categorical Hour variable which extracts the calcuated hour of a day from the counting time, assuming the first transaction occurs on 12:00am.

Of the features you investigated, were there any unusual distributions?  Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the left-skewed V1 and right-skewed Amount to visualize the data easily. The transformed V1 appear a distribution with three peaks.

Bivariate Plots Section

Explore Time vs Class

The plots show the frauds have a different distribution on Time compared to normal transactions. The number of frauds on daytime is a bit higher than at night, but it does not have a significant drop-down at night.

Explore Hour vs Class

Clearly, the frauds can occur any time in a day. However, it looks like the time does not tell a rule to distinguish fraud and nonfraud transactions because there are still thousands of normal transactions at night.

Explore Amount and transformed Amount vs Class

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    9.25  122.20  105.90 2126.00

It seems the distributions of transaction amount of fraud or nonfraud are similar.

Explore V1-V28 vs Class

The density distribution of V1-V28 by Class

The plots show the distribution of frauds have a lower kurtosis (more flatten) compared to the normal transactions. Some features have a apparantly different median of values on the frauds. I think those features with less overlapping area under the density functions can be useful to detect the frauds.

Explore Amount vs Hour

The plot shows most of transactions amount are less than 200, the median amount is various around 20 during a day.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

It seems the frauds can occur uniformly anytime in a day, not relied on day and night. Since the number of normal transactions drops down at night, the probablity that a transaction is a fraud will slightly increase at night.

The smallest amount of fraud is 0 and the larget amount of fraud is 2126. I don’t see any specific amount that has a significantly higher probability indicating it is a fraud.

The features V1-V28 seem more informed because most of these features show different distributions between fraud and nonfraud.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The median amount of transactions at day is higher than night. The daytime transactions tend to have higher both number and amount.

What was the strongest relationship you found?

The features V1-V7 V9-V12 V14 V16-V19 V21 have apparently distinct shapes of density across two Class. I think these features are very important to detect the frauds.

Multivariate Plots Section

Amount vs Time by Class

The red points show the occurance of a fraud. It looks like the reds points are always surrounded by white points so that we can’t conclude any patten that frauds behave differently from normality. The Amount may not be useful for fraud detection.

Time series plot V1-V28 by Class

Looking at the red points, if they are not surrounded by or far away from any white point, I think a surpervised learning model is able to draw a boundary to separate the frauds. Based on the plots above, I would like to select the better features V1-V5 V7-V12 V14 V16-V18 which clearly separated the most red points from the white point clusters.

There are a couple of features that have a clear shift during a specific time in a day e.g. V12 V13. I am curious about the hours when the shift occurs. I think the Hour is a useful feature that should be included in the model.

Let’s explore V12 and V13.

The plots show the shifts occur on time everyday from 1:00 to 7:00. Interestingly, those transactions ‘forget’ to shift V12 value back to normal at daytime, are probably being regarded as frauds.

Pairs plot of all features by Class

##  [1] "Time"     "V1"       "V2"       "V3"       "V4"       "V5"      
##  [7] "V6"       "V7"       "V8"       "V9"       "V10"      "V11"     
## [13] "V12"      "V13"      "V14"      "V15"      "V16"      "V17"     
## [19] "V18"      "V19"      "V20"      "V21"      "V22"      "V23"     
## [25] "V24"      "V25"      "V26"      "V27"      "V28"      "Amount"  
## [31] "Class"    "Hour"     "Amount_A" "V1_A"

The image size is very large, I’ve saved a high resolution version here

The pairs plot shows that the normal transactions do not have significant correlation between features. However, the frauds have some features correlated.

Explore correlations

The plots show that the features are almost not correlated for the normal transactions. However, the frauds have strong correlations among these features V1-V5 V7 V9-V12 V14 V16-V19.

t-SNE plot

## Read the 10473 x 39 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 10473
##  - point 10000 of 10473
## Done in 8.19 seconds (sparsity = 0.012363)!
## Learning embedding...
## Iteration 50: error is 97.811603 (50 iterations in 4.80 seconds)
## Iteration 100: error is 88.920287 (50 iterations in 5.07 seconds)
## Iteration 150: error is 84.392067 (50 iterations in 4.91 seconds)
## Iteration 200: error is 83.664107 (50 iterations in 4.86 seconds)
## Iteration 250: error is 83.335895 (50 iterations in 4.86 seconds)
## Iteration 300: error is 3.075141 (50 iterations in 4.59 seconds)
## Iteration 350: error is 2.651152 (50 iterations in 4.62 seconds)
## Iteration 400: error is 2.411062 (50 iterations in 4.46 seconds)
## Iteration 450: error is 2.251608 (50 iterations in 4.53 seconds)
## Iteration 500: error is 2.136952 (50 iterations in 4.54 seconds)
## Iteration 550: error is 2.049831 (50 iterations in 4.58 seconds)
## Iteration 600: error is 1.981499 (50 iterations in 4.61 seconds)
## Iteration 650: error is 1.926512 (50 iterations in 4.56 seconds)
## Iteration 700: error is 1.882047 (50 iterations in 4.61 seconds)
## Iteration 750: error is 1.846762 (50 iterations in 4.61 seconds)
## Iteration 800: error is 1.818805 (50 iterations in 4.62 seconds)
## Iteration 850: error is 1.797712 (50 iterations in 4.74 seconds)
## Iteration 900: error is 1.781142 (50 iterations in 4.75 seconds)
## Iteration 950: error is 1.768946 (50 iterations in 4.78 seconds)
## Iteration 1000: error is 1.759527 (50 iterations in 5.54 seconds)
## Fitting performed in 94.64 seconds.

I choose the features V1-V5 V7 V9-V12 V14 V16-V18 and Hour to run a t-SNE algorithm since these features show up stronger fraud patterns. The t-SNE plot contains all fraud points and 10000 samples of nonfraud. The plot shows two major clusters of frauds (upper left and lower left) as well as other individual fraud whose pattern or features may look very similar to normal transactions so as hard to be identified.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The time series plots of features are more helpful to see the transaction distribution vary during a day. I will select the most useful features V1-V5 V7 V9-V12 V14 V16-V18 for fraud detection.

From the correlation heat matrix, I see some features are highly correlated e.g. V16-V18. To avoid redundancy, I think some correlated features can be dropped from a model.

Were there any interesting or surprising interactions between features?

The features like V12 V13 have a periodic shift at 1:00-7:00 everyday, also the distributions are various when the shift occurs.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes. I create a script based on Python. The script is to build up a baseline neural network model for fraud detection.

The model scores around 0.8 AUPRC and is able to detect about 80% of frauds without interfering many customers. However, increasing the rate above 80% is very difficult because a huge number of customers would be inspected while only a few more frauds would be discovered.


Final Plots and Summary

Plot One

Description One

Plot one shows the Amount of transaction during two days, the red points are fraudulent transactions.

Plot Two

Description Two

Plot two indicates a distribution shift on V12 from 1:00 to 7:00.

Plot Three

Description Three

The t-SNE plot reduce the high feature dimension into two. The plot shows two clusters of red points which are fraudulent transactions.


Reflection

The creditcard data set contains two days of transaction within only 0.172% frauds. I start by exploring individual features and the relationships on multiple features, eventually select the best features into a model. I also build up a baseline model which is able to detect 80% of frauds without interfering many customers.

I struggled selecting the best features that can distinguish frauds as much as possible. Some features are strongly correlated but I don’t have any background information besides Time and Amount to explain the correlations. I am still looking for high dimension visualization tools to better see any hidden fraud pattern across all features.

Due to the frauds are very rare, I am using AUPRC as the metric to evaluate a model. My model can achieve average 0.8 score as well as detect 80% of frauds. Anyway I think it’s very difficult to make a breakthrough above this score. The remaining 20% of frauds, unfortunately they do a nice job on camouflage, of which the values of V1-V28 are all close to zero the mean of normal transactions. Hence, I assert the existing features are not sufficient to uncover all frauds. Collecting more features and more transaction records on different days are recommended to make a better classification model.

The future work I think will investigate the fraudulent cases that are failed to be detected by the model.